Report

Applied Statistics course HdM 2023

1 Setup

Code
import pandas as pd
import numpy as np
import missingno as mno # needed to visualize missing values. install missingno into conda if import does not work!
import altair as alt
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px # needed for US map
import xlrd # needed to read excel files. install xlrd into conda if import does not work!
import shutil # needed to copy files
import time
import datetime
import warnings
import joblib
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.impute import KNNImputer
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.linear_model import LassoCV
from sklearn.preprocessing import PolynomialFeatures
from sklearn.neighbors import KNeighborsRegressor
from sklearn.linear_model import RidgeCV
from sklearn.pipeline import make_pipeline
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error

warnings.simplefilter(action='ignore', category=FutureWarning)
alt.data_transformers.disable_max_rows()
DataTransformerRegistry.enable('default')

2 Introduction and data

The aim of this project is to investigate, whether there is a correlation between the household income and the death rate in the United States of America. In order to explore this relation, we gathered data on both topics and will analyse how and to what extense the death rate is impacted by the household income.

2.1 Research Question

We want to answer the following question:

Does the household income have an impact on the deathrates in the U.S. and if yes, how big is it?

Our hypotheses regarding the research question is:

The household income and the death rate will have a negative correlation.

Meaning, that the higher the household income is, the lower the death rate will be.

The predictor variable will be the median household income. The main response variable will be the age-adjusted death rate. Further insights can be gained by using categories like death cause, state or year. Other useful information will be provided by the amount of total deaths. The data dictionary below, is showing more details about the required variables.

You can check the appendix for additional information regarding the research question.

Name Description Role Type Format
0 state the U.S. state where data was collected predictor nominal category
1 year considered years (1999 - 2017) predictor numeric discrete date
2 median_household_income median household income predictor numeric continuous float
3 cause name the generic name for the death cause predictor nominal category
4 113 cause name NDI ICD-10 113 categories for causes of death Not used nominal category
5 deaths count of the total deaths response numeric discrete int
6 Age-adjusted Death Rate standardized ratio of deaths per 100k population response numeric continuous float

3 Data

3.1 Import data

3.2 Data structure

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10868 entries, 0 to 10867
Data columns (total 6 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Year                     10868 non-null  int64  
 1   113 Cause Name           10868 non-null  object 
 2   Cause Name               10868 non-null  object 
 3   State                    10868 non-null  object 
 4   Deaths                   10868 non-null  int64  
 5   Age-adjusted Death Rate  10868 non-null  float64
dtypes: float64(1), int64(2), object(3)
memory usage: 509.6+ KB
Year 113 Cause Name Cause Name State Deaths Age-adjusted Death Rate
0 2017 Accidents (unintentional injuries) (V01-X59,Y8... Unintentional injuries United States 169936 49.4
1 2017 Accidents (unintentional injuries) (V01-X59,Y8... Unintentional injuries Alabama 2703 53.8
2 2017 Accidents (unintentional injuries) (V01-X59,Y8... Unintentional injuries Alaska 436 63.7
3 2017 Accidents (unintentional injuries) (V01-X59,Y8... Unintentional injuries Arizona 4184 56.2
4 2017 Accidents (unintentional injuries) (V01-X59,Y8... Unintentional injuries Arkansas 1625 51.8

In the death dataset we have 10868 cases and 6 columns.

Table 102.30. Median household income, by state: Selected years, 1990 through 2017 Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7 Unnamed: 8 Unnamed: 9 ... Unnamed: 22 Unnamed: 23 Unnamed: 24 Unnamed: 25 Unnamed: 26 Unnamed: 27 Unnamed: 28 Unnamed: 29 Unnamed: 30 Unnamed: 31
0 [In constant 2017 dollars. Standard errors app... NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 State 1990\1\ 2000\2\ 2005.0 NaN 2010.0 NaN 2013.0 NaN 2014.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 1 2 3 4.0 NaN 5.0 NaN 6.0 NaN 7.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 United States ........... 57500 62000 58200.0 80.0 56400.0 40.0 55100.0 40.0 55600.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 Alabama .................... 45200 50400 46400.0 400.0 45600.0 320.0 45200.0 410.0 44400.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 32 columns

3.3 Data corrections

Income Dataset

From the overview, we have seen that the df_income dataset needs to be cleaned:

  • define column names
  • remove columns and rows with only null values
  • remove unnecessary characters such as whitespaces or trailing dots
  • transform the whole dataset by melting the data to only retain 3 columns
  • rename the new columns to be lowercase
  • declare correct column types
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 468 entries, 0 to 467
Data columns (total 3 columns):
 #   Column                   Non-Null Count  Dtype   
---  ------                   --------------  -----   
 0   state                    468 non-null    category
 1   year                     468 non-null    int32   
 2   median_household_income  468 non-null    float64 
dtypes: category(1), float64(1), int32(1)
memory usage: 8.5 KB
state year median_household_income
0 United States 1990 57500.0
1 Alabama 1990 45200.0
2 Alaska 1990 79300.0
3 Arizona 1990 52700.0
4 Arkansas 1990 40500.0

Death Dataset

From the overview of the death dataset, we see the following things:

  • The Dtypes 113 Cause Name, Cause Name and State need to be changed to category
  • The year, deaths and death rate column already have the right typing
  • The column names need to be adjusted to be lowercase and have underscores instead of spaces
  • Also, there are no missing values present in the dataset

Joined Dataset

Now, we are able to add the median household income from the income dataset to the death dataset by adding the corresponding value to the correct year and state present in the death dataset.

We take a look at the joined dataset to see if we need to do anything before using it:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10868 entries, 0 to 10867
Data columns (total 7 columns):
 #   Column                   Non-Null Count  Dtype   
---  ------                   --------------  -----   
 0   year                     10868 non-null  int64   
 1   113_cause_name           10868 non-null  category
 2   cause_name               10868 non-null  category
 3   state                    10868 non-null  category
 4   deaths                   10868 non-null  int64   
 5   age_adjusted_death_rate  10868 non-null  float64 
 6   median_household_income  4576 non-null   float64 
dtypes: category(3), float64(2), int64(2)
memory usage: 459.6 KB

Imputed Dataset

We will impute the missing values by using a KNN Imputation since that will typically result in a good imputation for numerical values. The data needs to be scaled in order for the algorithm to perform well.

After the imputation, we’ll have to use the inverse_transform() function from MinMaxScaler to bring the scaled dataset back in the original form.

In the appendix, you can find the imputation analysis and statistics in detail.

3.4 Variable lists

The list of used variables with a detailed data dictionary can be found in the table under introduction.

3.5 Data splitting

4 Analysis

We will focus on the correlation between the median income and the age-adjusted death rate over all death causes summarized. Although we suspect the death cause, year and the state to be indicators for variation on a more detailed level, this analysis would be beyond the scope of this project.

In order to test our hypotheses we will inspect summary statistics and use different visualizations in order to understand the relations between the predictor and response variables and gain further knowledge.

First we copy the training dataframe. This way we prevent the training data to be changed during data exploration:

4.1 Descriptive statistics

Let’s take a look at the summary statistics for the exploration dataframe:

count mean std min 25% 50% 75% max
year 7607.0 2007.97 5.47 1999.0 2003.0 2008.0 2013.0 2017.0
deaths 7607.0 14941.20 108968.33 21.0 616.0 1734.0 5802.5 2813503.0
median_household_income 7607.0 57785.01 9368.38 40000.0 50400.0 55800.0 63400.0 82400.0
age_adjusted_death_rate 7607.0 126.77 221.84 2.6 19.2 36.0 153.2 1061.2
year                          10.0
deaths                      5186.5
median_household_income    13000.0
age_adjusted_death_rate      134.0
dtype: float64

The summary statistics for the total amount of deaths are similar to the death rate regarding the difference from minumum and maximum values as well as the range from the third quartile to the maximum. Our first interpretation is that the summarized values for every cause of death (described with the category ‘All causes’) could lead to the effect seen in the table above.

We will check those interpretations in the next segment (exploratory data analysis) and provide a more detailed interpretation for the age adjusted death rate as well as the median household income.

4.2 Exploratory data analysis

The distribution for the age adjusted death rate looks like this:

<matplotlib.legend.Legend at 0x186a36eb970>

The statistic and distribution for the age adjusted death rate show: - 75 % of the values are at or below 153.2 while the maximum goes up to 1061.2. This is a heavily right skewed distribution with alot of outliers since the IQR is at 134. - The distribution visualizes this effect but does not explain it yet. It also shows that it is bimodal. - The standard deviation shows that the values differ alot from the mean which can also be explained with the skew.

We need to take a more detailed look to correctly intrepret this distribution:

Here we can see that the reason for the distribution clearly because the cause_name feature has individual causes as well as a summary for all causes stored within the same column. We will make another dataframe filtered by ‘All causes’ to serve as a summary dataset:

Let’s look at the statistics and distribution of the age adjusted death rate again

count mean std min 25% 50% 75% max
age_adjusted_death_rate 675.0 800.13 98.12 572.0 724.4 784.5 869.5 1061.2
<matplotlib.legend.Legend at 0x186ab94fca0>

We can see that the distribution has changed: - The distribution is unimodal and still right skewed but alot closer to a normal distribution. - We do not observe the wide range of values as before, there are no more outliers present. - The mean is still higher than the median, which can be explained by the amount of high values around 900-1000.

Next we visualize the distribution for the median household income:

<matplotlib.legend.Legend at 0x186aba8cfd0>

The summary statistics for the median household income in combination with the distribution show:

  • The distribution is unimodal and right skewed extending from 40k $ to over 80k $.
  • The median is found at 56350 $ and with an IQR at 13000 there are no outliers present in the income data.
  • The median (55800 $) is not in the middle between minimum and maximum (which would be around the mark where the third quartile is). This means the distribution is right skewed.

The imputed data is very close to the original data present in the income dataset. A comparison for the distribution before and after the imputation can be found in the appendix.

To visualize the relation between age-adjusted death rate and median household income, we analyze scatterplots:

There seems to be a moderate to strong negative correlation between the two variables. We suspect the correlation to not be linear which we will investigate later.

Additional data exploration can be found in the appendix.

4.3 Relationships

We will first look at the correlation of all features and then inspect the correlation for our summarized dataframe

  year 113_cause_name cause_name state deaths median_household_income age_adjusted_death_rate
year 1.000000 0.005480 0.005480 0.002946 0.013353 -0.077663 -0.027888
113_cause_name 0.005480 1.000000 1.000000 0.007820 0.097992 -0.015742 0.402470
cause_name 0.005480 1.000000 1.000000 0.007820 0.097992 -0.015742 0.402470
state 0.002946 0.007820 0.007820 1.000000 0.004487 -0.023781 0.005450
deaths 0.013353 0.097992 0.097992 0.004487 1.000000 0.005334 0.224213
median_household_income -0.077663 -0.015742 -0.015742 -0.023781 0.005334 1.000000 -0.031676
age_adjusted_death_rate -0.027888 0.402470 0.402470 0.005450 0.224213 -0.031676 1.000000

We can see that both features 113_cause_name and cause_name provide the same information. Like we predicted we only need to keep one column for the model. Therefore we will drop the column ‘113_cause_name’.

We can ignore the obvious correlations between the age_adjusted_death_rate and cause_name, or deaths as well as deaths and cause_name, since those do not provide extra information.

For the summarized dataframe, the correlation matrix looks like this:

  year deaths median_household_income age_adjusted_death_rate
year 1.000000 0.033989 -0.078670 -0.446958
deaths 0.033989 1.000000 0.005858 -0.045224
median_household_income -0.078670 0.005858 1.000000 -0.512797
age_adjusted_death_rate -0.446958 -0.045224 -0.512797 1.000000

The correlation matrix shows, that there is a moderate to strong negative correlation for year and median_household_income. That means, that our hyphothesis is supported by the data. The data shows in addition, that the more years pass, the less people die (looking at data from 1999-2017) which could be explained by the improvements in the medical sector as well as the standard of living. To verifiy this assumption we would need further analysis and possibly more data, but this is not within the scope of out project.

No other significant correlation is visible.

5 Methodology

6 Model

Since the predictor and the response variable are numeric and we try to find a pattern between them, we have a regression problem. We will start with simple linear regression even though we do not assume a linear relationship. In addition we use lasso regression and polynomial regression and compare the models to see which performs better.

We will only use the variable ‘all causes’ and filter the death rates by it, because we focus on the total deathrate in relation to the median household income.

Before we start we encode the categorical features as numeric features for the models and perform the data splitting again since we made changes. We will also drop the year column since we are not evaluating a time series and only use the extra data provided and the death column since it is already factor into the death rate.

6.1 Linear Regression

Select model

Training and validation

Fit model

LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Evaluation on test set

Save metrics for comparison

Further analysis of the linear regression model outputs can be found in the appendix.

6.2 Lasso Regression

First we need to standardize our numerical features, in order to use lasso regression:

Select model

Next we let the algorithm chose which alpha (~number of features used in the model) provide the best results

Lasso(alpha=0.7649445649117792)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Training and validation

Fit model

Lasso(alpha=0.7649445649117792)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Evaluation on test set

Save metrics

Further analysis of the lasso regression model outputs can be found in the appendix.

6.3 Polynomial Regression

Select model

For Polynomial Regression we need to transform the features to a 2D Space with fit_transform()

Fit model

LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Evaluation on test set

Save metrics

6.4 Model comparison

Model R2 MSE RMSE MAE
0 Linear Regression 0.222 7051.781 83.975 68.649
1 Lasso Regression 0.222 7052.469 83.979 68.616
2 Polynomial Regression 0.317 6191.236 78.684 63.386

We can see that the Polynomial Regression model performs the best given our data.

The next section gives a more indepth analysis to the model results.

6.5 Save model

['../models/reg_las_model.pkl_2023-01-10-20_18_26']

7 Results

We have seen that polynomial regression performs the best given our data.

Model R2 MSE RMSE MAE
0 Polynomial Regression 0.317 6191.236 78.684 63.386

Although these metrics are overall the best out of the three models that we ran, there is still alot of room left for improvement.

  • The error values (MSE, RMSE, MAE) show that the data spreads alot around the fitted line
  • is a measure for how well the model fits the data, more specific how much variation of the outcome variable (in our case the age adjusted death rate) can be explained with the predictor (the median household income).
  • An score of 0.317 means that 31.7 % of the variation for the age adjusted death rate can be explained by solely observing the median household income.

For a variable as complex as death rates we consider this to be at least a decent model when only taking the median household income as the sole predictor into account.

The plot underneath shows the final fit made by the polynomial regression. Ignoring the jagged line (a smoothing failure on our part), it depicts a quadratic fit to the data.

It could be that a higher polynomial is a better fit when observing the relation between income and death rate, or that the death rate is too complex to only be described by one variable (which we strongly assume).

Plot

8 Discussion + Conclusion

We have seen the following:

The correlation matrix in the relationships section as well as the scatterplot in the data exploration showed, that our hypothesis from the beginning is supported by the data. We stated that: - The household income and the death rate will have a negative correlation.

There is a moderate to strong negative correlation for year and the median household income. The data shows in addition, that the more years pass, the less people die (looking at data from 1999-2017). We assumed that this could be explained by the improvements in the medical sector as well as the standard of living. We did not verify this assumption.

The polynomial regression model was not the best one because of the low R², high error values for MSE/ MAE/ RMSE. Improvement can be made by trying out more models or add more features (feature engineering / model ensembling).

With the models we used, our research question could already be answered with a yes:

Does the household income have an impact on the deathrates in the U.S. and if yes, how big is it?

The impact is quantified by the metric R² and we consider a value of 0.317 to be significant enough to answer with a yes.

8.1 Outlook

We have seen a significant correlation between the age-adjusted death rate and the years. This could be further analyzed and additional time series models could be built. It could be also interesting to carry out model fits, at the level of individual years and compare the metrics with each other, in order to be able to make statements over the years.

Another option is to gather more data, perform feature engineering and build model ensembles to improve overall performance of the regression models.

For further analysis more data could be gathered to fill the gaps that we imputed in order to gain more realistic data.

In the appendix we made a scatterplot for every death cause related to the median household income that in certain states, certain death causes are more prominent than others. So the correlation between death rate and median household income could be interesting regarding the specific death cause. Perhaps, for certain causes of death, the rate increases with increasing income.

9 Appendix

Not structured yet - additional information that may be used

9.1 Additional Information regarding the research question

Our research question is backed by the following studies: - KINGE, Jonas Minet, et al. Association of household income with life expectancy and cause-specific mortality in Norway, 2005-2015. Jama, 2019, 321. Jg., Nr. 19, S. 1916-1925. (https://jamanetwork.com/journals/jama/article-abstract/2733322) - KAPLAN, George A., et al. Inequality in income and mortality in the United States: analysis of mortality and potential pathways. Bmj, 1996, 312. Jg., Nr. 7037, S. 999-1003. (https://www.bmj.com/content/312/7037/999.full) - O’CONNOR, Gerald T., et al. Median household income and mortality rate in cystic fibrosis. Pediatrics, 2003, 111. Jg., Nr. 4, S. e333-e339. (https://publications.aap.org/pediatrics/article-abstract/111/4/e333/63113/Median-Household-Income-and-Mortality-Rate-in)

Although the first study was done in Norway and the second study investigates mortality instead of death rate, we suspect to gather similar observations.

Added information on mortality rate:

Mortality is a fact that refers to susceptibility to death. While there is a crude death rate that refers to number of deaths in a population in a year, mortality rate is the number of deaths per thousand people over a period of time that is normally a year. (see: https://www.differencebetween.com/difference-between-death-rate-and-vs-mortality-rate/).

9.2 Imputation Analysis

<AxesSubplot: >

After joining, the median_household_income is now an additional column for the death data set.

We only have the median household income for the year 1990, 2000, 2005, 2010 and 2013-2017. In the death dataset, we find the years starting from 1999 until 2017. That is the reason why we only have 4576 non-null values for the median household income. This means that roughly 58 % of the column has empty values which we need to either fill (by imputing) or remove.

Since removing would lead to our dataset to be cut by over half, that is not a viable option.

First lets take a look at the summary statistics for the median household income and its distribution.

9.3 Summary Statistics

count mean std min 25% 50% 75% max
median_household_income 4576.0 58179.33 9580.27 40000.0 51050.0 56350.0 64000.0 82400.0
median_household_income    12950.0
dtype: float64

Distribution

<AxesSubplot: title={'center': 'Distribution of median household income before imputation'}, ylabel='Density'>

Plot

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10868 entries, 0 to 10867
Data columns (total 7 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   year                     10868 non-null  int64  
 1   113_cause_name           10868 non-null  object 
 2   cause_name               10868 non-null  object 
 3   state                    10868 non-null  object 
 4   deaths                   10868 non-null  int64  
 5   age_adjusted_death_rate  10868 non-null  float64
 6   median_household_income  4576 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 594.5+ KB

The distribution is unimodal and right skewed extending from 40k $ to over 80k $.

The median is found at 56350 $ and with an IQR at 12950 there are no outliers present in the income data.

Also the values for the years 1999, 2001-2004, 2006-2009, 2011 & 2012 are missing.

9.4 Imputed dataset

Let us take a look what the imputation has done for the missing values and how we can use the result:

Summary statistics

mean std min 50% max
median_household_income 58179.326923 9580.271444 40000.0 56350.0 82400.0
income_KNN_Scaled 57765.145381 9375.632377 40000.0 55800.0 82400.0
median_household_income    12950.0
income_KNN_Scaled          13000.0
dtype: float64

Distribution

<AxesSubplot: title={'center': 'Distribution of median household income after imputation'}, ylabel='Density'>

Plot

From the summary statistics we can see that the imputed values represent the original values quite good, the values are close to the existing ones. The IQR is slightly greater but there are still no outliers after the imputation. This can be explained with the method we used, since the KNN algorithm imputes the values by calculating distances to existing neighbour data points for every instance we want to replace.

We found that by setting the number of neighbours (n_neighbours) to 1, the imputed data represents the original data the best without making assumptions. Since we are no domain experts, we want to change the data as little as possible.

The scatterplot shows that the missing values for median household income have been copied from years with existing values. For example the years 1999 - 2002 all share the values from 2000, 2003 - 2007 the values from 2005 and 2008 - 2011 the values from 2010. 2012 has been cipied from 2013. This has the advantage, that it maintains the spread in the data. The disadvantage is, that we make the implicit assumption that the median household income by state stayed the same for the years close to 2000, 2005 and 2010.

This is visible by comparing the boxplots before and after imputation which are very similar.

In reality the median income almost certainly did not stay the same, but for our case this gives a decent estimation for training our model.

This will leave us with 6 columns in total, which we can use for our data analysis (we will later disregard the 113 Cause name column since we will see, that the same information is provided by the cause name column):

9.5 Additional data exploration

The scatterplots for the relation between median household income and age-adjusted death rate for every single category looks like this:

These charts are showing all death causes in detail. We have one chart for each death cause. Also there is the same negative correlation as in the plot for all death causes.

It seems like, that the death cause is only slightly influencing this correlation. Cancer seems like the death cause with the strongest negative correlation. This could be used for further analysis and model training considering a specific death cause.

This chart could be interesting, if we could extend it and visualize the death rate per state and year.

Also for additional analysis: How is the household income developing year by year in the different states? Are there high variances? Is there a relationship to the death rates?

This graph is showing us, that heart disease in Arizona and cancer in South Dakota, are with distance the top death causes regarding death rate. We remember, that cancer has a really strong negative correlation with the median income.

Further analysis: How is the median income distributed year by year in South Dakoka?

https://altair-viz.github.io/gallery/top_k_items.html

Data has to be customized to get a good visualization here.

10 Map does not work!! Reason unknown and to be find out.

10.1 Model ideas

The following models could be interesting, because the models we have choosen were not ideal: * Polynomial Regression, in case the relationship between predictor and response variable is not linear * Bayesian Regression * Decision Tree Regression, mainly xgboost * Gradient Descent Regression

In order to find the best performing model, they could be compared using a specific metric (e.g. R², Mean Squared Error or Mean Absolute Error)

10.2 Additional model result analysis

Linear Regression

Cross Validation

count mean std min 25% 50% 75% max
lr 5.0 7462.394258 1197.570953 6063.244422 6370.508411 7715.406685 8485.279789 8677.531982

Coefficients

Name Coefficient
0 Intercept 1087.116
1 median_household_income -0.005

Plot

Plot

Lasso Regression

Cross Validation

count mean std min 25% 50% 75% max
lr 5.0 7507.486421 1198.093726 6070.625075 6469.607034 7727.331895 8566.590051 8703.278049

Coefficients

Name Coefficient
0 Intercept 802.619
1 state -0.007
2 median_household_income -46.177

This shows that the lasso regression in our case uses the same features to train the model (only median household income).

Plot